Skip to content

Conversation

@reasonsolo
Copy link
Collaborator

@reasonsolo reasonsolo commented Oct 28, 2025

Split openai_disagg_server.py into different modules:

  • fastapi related code remains in openai_disagg_server.py
  • request dispatching goes to openai_disagg_service.py
  • sending http request to ctx/gen goes to openai_client.py
  • perf metrics goes to perf_metrics.py, also add prometheus metrics for disagg-serving

Summary by CodeRabbit

Release Notes

  • New Features

    • Added OpenAI-compatible HTTP client with streaming support and automatic retry logic for improved reliability.
    • Introduced disaggregated serving with separate context and generation phases for optimized performance.
    • Added performance metrics collection and health monitoring capabilities.
    • Enabled real-time worker event callbacks for dynamic cluster management.
  • Refactor

    • Streamlined server architecture with improved lifecycle management and cleaner service patterns.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@reasonsolo reasonsolo force-pushed the TRTLLM-8920_decouplefastapi branch 3 times, most recently from 104984f to 577e9db Compare November 3, 2025 10:41
@reasonsolo reasonsolo marked this pull request as ready for review November 3, 2025 10:44
@reasonsolo reasonsolo requested review from a team as code owners November 3, 2025 10:44
@reasonsolo reasonsolo force-pushed the TRTLLM-8920_decouplefastapi branch from 577e9db to 8ea9d89 Compare November 3, 2025 10:44
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 3, 2025

📝 Walkthrough

Walkthrough

The changes introduce a service-oriented architecture for OpenAI disaggregated serving. New abstractions include OpenAIService base class, OpenAIClient interface with HTTP implementation, OpenAIDisaggregatedService for orchestrating context/generation flows, metrics collection infrastructure, and response hooks. Supporting utilities and routing are updated to integrate these components.

Changes

Cohort / File(s) Summary
Type Definitions & Protocols
tensorrt_llm/serve/openai_protocol.py, tensorrt_llm/serve/openai_service.py, tensorrt_llm/serve/responses_utils.py
Added type aliases UCompletionRequest/UCompletionResponse; introduced abstract OpenAIService base with lifecycle methods; added ResponseHooks abstract interface for request lifecycle callbacks; added CompletionResponseGenerator type alias and done_generator utility; enhanced ServerArrivalTimeMiddleware to populate ctx_server and gen_server in HTTP scope.
HTTP Client Abstraction
tensorrt_llm/serve/openai_client.py
Introduced abstract OpenAIClient interface with send_request, _send_request, collect_metrics, check_ready, shutdown, and _finish_request methods; implemented OpenAIHttpClient with aiohttp session management, request retry logic with backoff, streaming response handling via _response_generator, per-token latency metrics, completion callbacks through hooks, and integration with Router for server selection.
Performance Metrics Collection
tensorrt_llm/serve/perf_metrics.py
Added ClientMetricsCollector for per-client counters and histograms (total, error, retry, completed, latency); added DisaggPerfMetricsCollector for aggregating per-request metrics across clients, maintaining bounded request queue, matching generation/context metrics, and deferred processing of unfinished requests.
Disaggregated Service Implementation
tensorrt_llm/serve/openai_disagg_service.py
Introduced OpenAIDisaggregatedService (subclass of OpenAIService) orchestrating context and generation requests with conditional disaggregation, request normalization, server readiness waiting, worker-event handling, and response validation; added OpenAIDisaggregatedPreAllocService variant with parallelized context/generation execution via TaskGroup.
Server Integration & Routing
tensorrt_llm/serve/openai_disagg_server.py, tensorrt_llm/llmapi/disagg_utils.py
Refactored server to use service-oriented approach with OpenAIDisaggregatedService, replaced in-process routing with RawRequestResponseHooks for metrics collection; exposed /prometheus/metrics and /perf_metrics endpoints; added steady clock offset alignment; added helper functions for router creation and client initialization; simplified entry points via _wrap_entry_point; renamed get_ctx_gen_server_urls to get_ctx_gen_server_addrs and removed "http://" prefix from address construction.
Worker Management
tensorrt_llm/serve/disagg_auto_scaling.py
Extended DisaggClusterManager.watch_workers to accept optional on_event callback; added background task (_watch_task) to drain and forward subsequent events to callback; processes existing workers on first run; returns empty list if watch already initialized.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Server as OpenAI Server
    participant Service as OpenAIDisaggregatedService
    participant CtxRouter as Context Router
    participant GenRouter as Generation Router
    participant Metrics as PerfMetricsCollector

    Client->>Server: POST /v1/chat/completions
    Server->>Server: _wrap_entry_point (create hooks)
    Server->>Service: openai_chat_completion(request, hooks)
    Service->>Service: _check_conditional_disagg()
    
    rect rgb(200, 220, 255)
    Note over Service: Disaggregated Path
    Service->>CtxRouter: send_request(ctx_request)
    CtxRouter-->>Service: context_response
    Service->>Metrics: on_ctx_resp
    end
    
    Service->>Service: _need_gen(ctx_response)
    
    rect rgb(220, 200, 255)
    Note over Service: Generation Path
    Service->>GenRouter: send_request(gen_request)
    GenRouter-->>Service: generation_response (streaming)
    Service->>Metrics: on_first_token
    end
    
    Service-->>Server: CompletionResponseGenerator
    rect rgb(200, 255, 220)
    Note over Server: Stream Handling
    loop per token
        Server->>Client: token chunk
        Server->>Metrics: update latency
    end
    end
    
    Server->>Metrics: on_resp_done
    Metrics-->>Client: response complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • openai_client.py: Dense async logic with retry mechanisms, streaming response handling, and metrics integration requires careful examination of error paths and state management
  • openai_disagg_service.py: Complex orchestration logic for context/generation request coordination, conditional disaggregation paths, and worker event handling needs thorough flow validation
  • perf_metrics.py: Thread-safe metric aggregation with bounded queues and deferred processing logic; edge cases around unfinished requests require close inspection
  • openai_disagg_server.py: Large-scale refactoring from in-process to service-oriented; clock offset synchronization and metrics exposure pathways need validation
  • disagg_auto_scaling.py: Background task management and callback invocation semantics should be verified for correctness
  • Interaction between components: Verify hook invocation sequencing, metric correlation across ctx/gen servers, and error handling propagation through service layers

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive PR description lacks concrete details about the refactoring. While the PR objectives mention decoupling disagg service from fastapi and splitting modules, the PR description provided only contains template guidance without explaining the motivation, specific changes, or test coverage. Add a clear Description section explaining why this refactoring is needed and what benefits it provides. Add a Test Coverage section listing relevant tests that validate the split modules and decoupled service. Fill out the PR Checklist to confirm compliance with coding guidelines and testing requirements.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title '[TRTLLM-8920][feat] decouple disagg service from fastapi' clearly and concisely summarizes the main objective of the changes. It follows the required format with a JIRA ticket identifier, feature type, and a descriptive summary that accurately reflects the architectural refactoring shown in the raw summary—specifically the decoupling of the disaggregation service from FastAPI dependencies. The title is specific enough that reviewers can understand the primary change at a glance.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/serve/disagg_auto_scaling.py (1)

101-123: Do not overwrite the initialized watch handle

After seeding existing workers, Line 121 reassigns _watch_handle with a second watch(...) call. That overwrites the first handle, so the events you just injected into the queue are lost and downstream consumers miss the pre-existing workers. Please keep the first handle and remove the redundant watch.

-        self._watch_handle = await self._cluster_storage.watch(
-            self.worker_key_prefix)
-
-        async def on_event_wrapper():
+        async def on_event_wrapper():
🧹 Nitpick comments (1)
tensorrt_llm/serve/openai_service.py (1)

18-22: Clarify the openai_completion return contract

Line 18 claims implementations should return a tuple, but the signature on Line 17 returns Union[CompletionResponse, CompletionResponseGenerator]. This contradiction is confusing for implementers and reviewers. Please align the docstring with the declared type (or update the type) so the contract is unambiguous.

Apply this diff to the docstring:

@@
-        """
-        Return a tuple of (completion response, async completion response generator)
-        When request is streaming, the generator will be used to stream the response.
-        When request is not streaming, the generator will be ignore and the response will be returned directly.
-        """
+        """Return either a CompletionResponse or a CompletionResponseGenerator.
+
+        Implementations should yield the generator when `request.stream` is true
+        and otherwise return the complete response object directly.
+        """
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8303cfa and 8ea9d89.

📒 Files selected for processing (9)
  • tensorrt_llm/llmapi/disagg_utils.py (1 hunks)
  • tensorrt_llm/serve/disagg_auto_scaling.py (4 hunks)
  • tensorrt_llm/serve/openai_client.py (1 hunks)
  • tensorrt_llm/serve/openai_disagg_server.py (3 hunks)
  • tensorrt_llm/serve/openai_disagg_service.py (1 hunks)
  • tensorrt_llm/serve/openai_protocol.py (1 hunks)
  • tensorrt_llm/serve/openai_service.py (1 hunks)
  • tensorrt_llm/serve/perf_metrics.py (1 hunks)
  • tensorrt_llm/serve/responses_utils.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/llmapi/disagg_utils.py
  • tensorrt_llm/serve/responses_utils.py
  • tensorrt_llm/serve/openai_protocol.py
  • tensorrt_llm/serve/openai_service.py
  • tensorrt_llm/serve/disagg_auto_scaling.py
  • tensorrt_llm/serve/openai_disagg_service.py
  • tensorrt_llm/serve/perf_metrics.py
  • tensorrt_llm/serve/openai_client.py
  • tensorrt_llm/serve/openai_disagg_server.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/llmapi/disagg_utils.py
  • tensorrt_llm/serve/responses_utils.py
  • tensorrt_llm/serve/openai_protocol.py
  • tensorrt_llm/serve/openai_service.py
  • tensorrt_llm/serve/disagg_auto_scaling.py
  • tensorrt_llm/serve/openai_disagg_service.py
  • tensorrt_llm/serve/perf_metrics.py
  • tensorrt_llm/serve/openai_client.py
  • tensorrt_llm/serve/openai_disagg_server.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/llmapi/disagg_utils.py
  • tensorrt_llm/serve/responses_utils.py
  • tensorrt_llm/serve/openai_protocol.py
  • tensorrt_llm/serve/openai_service.py
  • tensorrt_llm/serve/disagg_auto_scaling.py
  • tensorrt_llm/serve/openai_disagg_service.py
  • tensorrt_llm/serve/perf_metrics.py
  • tensorrt_llm/serve/openai_client.py
  • tensorrt_llm/serve/openai_disagg_server.py
🧬 Code graph analysis (7)
tensorrt_llm/serve/responses_utils.py (1)
tensorrt_llm/serve/openai_disagg_server.py (4)
  • on_req_begin (47-48)
  • on_ctx_resp (50-51)
  • on_first_token (53-55)
  • on_resp_done (57-59)
tensorrt_llm/serve/openai_service.py (3)
tensorrt_llm/serve/openai_protocol.py (4)
  • ChatCompletionRequest (500-717)
  • ChatCompletionResponse (441-450)
  • CompletionRequest (210-341)
  • CompletionResponse (146-155)
tensorrt_llm/serve/openai_disagg_service.py (5)
  • openai_completion (61-77)
  • openai_chat_completion (79-84)
  • is_ready (183-186)
  • setup (196-217)
  • teardown (219-228)
tensorrt_llm/serve/disagg_auto_scaling.py (1)
  • is_ready (224-225)
tensorrt_llm/serve/disagg_auto_scaling.py (2)
tensorrt_llm/serve/cluster_storage.py (10)
  • WatchEventType (26-28)
  • watch (94-95)
  • watch (239-248)
  • watch (388-390)
  • watch (533-546)
  • drain (43-51)
  • unwatch (98-99)
  • unwatch (250-257)
  • unwatch (392-394)
  • unwatch (548-551)
tensorrt_llm/logger.py (2)
  • error (126-127)
  • warning (132-133)
tensorrt_llm/serve/openai_disagg_service.py (8)
tensorrt_llm/llmapi/disagg_utils.py (4)
  • ConditionalDisaggConfig (42-43)
  • DisaggClusterConfig (59-64)
  • DisaggServerConfig (68-78)
  • ServerRole (19-22)
tensorrt_llm/serve/cluster_storage.py (2)
  • ClusterStorage (60-106)
  • WatchEventType (26-28)
tensorrt_llm/serve/disagg_auto_scaling.py (9)
  • DisaggClusterManager (32-229)
  • WorkerInfo (17-21)
  • is_ready (224-225)
  • cluster_info (65-82)
  • start (58-59)
  • watch_workers (96-143)
  • stop (61-63)
  • worker_info (265-269)
  • worker_id (261-262)
tensorrt_llm/serve/openai_client.py (4)
  • OpenAIClient (28-70)
  • send_request (29-41)
  • check_ready (61-63)
  • check_ready (236-249)
tensorrt_llm/serve/openai_protocol.py (3)
  • ChatCompletionRequest (500-717)
  • CompletionRequest (210-341)
  • DisaggregatedParams (104-109)
tensorrt_llm/serve/openai_service.py (6)
  • OpenAIService (13-39)
  • openai_completion (15-23)
  • is_ready (33-33)
  • openai_chat_completion (26-30)
  • setup (36-36)
  • teardown (39-39)
tensorrt_llm/serve/responses_utils.py (3)
  • ResponseHooks (894-919)
  • done_generator (922-923)
  • on_req_begin (900-901)
tensorrt_llm/serve/router.py (5)
  • KvCacheAwareRouter (541-647)
  • Router (146-410)
  • start_server_monitoring (204-216)
  • stop_server_monitoring (218-231)
  • remove_server (183-194)
tensorrt_llm/serve/perf_metrics.py (1)
tensorrt_llm/serve/openai_client.py (2)
  • collect_metrics (58-58)
  • collect_metrics (223-231)
tensorrt_llm/serve/openai_client.py (4)
tensorrt_llm/serve/openai_protocol.py (4)
  • ChatCompletionRequest (500-717)
  • ChatCompletionResponse (441-450)
  • CompletionRequest (210-341)
  • CompletionResponse (146-155)
tensorrt_llm/serve/perf_metrics.py (5)
  • ClientMetricsCollector (42-60)
  • inc (56-57)
  • inc (150-151)
  • observe (59-60)
  • observe (153-154)
tensorrt_llm/serve/responses_utils.py (5)
  • ResponseHooks (894-919)
  • get_steady_clock_now_in_seconds (86-87)
  • on_ctx_resp (904-905)
  • on_first_token (908-912)
  • on_resp_done (915-919)
tensorrt_llm/serve/router.py (1)
  • Router (146-410)
tensorrt_llm/serve/openai_disagg_server.py (8)
tensorrt_llm/executor/executor.py (1)
  • CppExecutorError (60-68)
tensorrt_llm/llmapi/disagg_utils.py (3)
  • DisaggServerConfig (68-78)
  • MetadataServerConfig (82-87)
  • get_ctx_gen_server_addrs (90-101)
tensorrt_llm/serve/cluster_storage.py (4)
  • HttpClusterStorageServer (142-296)
  • create_cluster_storage (109-114)
  • client (464-465)
  • add_routes (158-167)
tensorrt_llm/serve/openai_client.py (2)
  • OpenAIClient (28-70)
  • OpenAIHttpClient (73-249)
tensorrt_llm/serve/openai_disagg_service.py (8)
  • OpenAIDisaggregatedService (34-271)
  • disagg_cluster_config (189-190)
  • setup (196-217)
  • teardown (219-228)
  • openai_completion (61-77)
  • openai_chat_completion (79-84)
  • cluster_info (177-181)
  • is_ready (183-186)
tensorrt_llm/serve/responses_utils.py (6)
  • ResponseHooks (894-919)
  • get_steady_clock_now_in_seconds (86-87)
  • on_req_begin (900-901)
  • on_ctx_resp (904-905)
  • on_first_token (908-912)
  • on_resp_done (915-919)
tensorrt_llm/serve/perf_metrics.py (4)
  • DisaggPerfMetricsCollector (63-154)
  • add_per_request_metrics (74-91)
  • add_client (71-72)
  • get_perf_metrics (93-148)
tensorrt_llm/serve/router.py (2)
  • Router (146-410)
  • create_router (650-685)
🪛 Ruff (0.14.2)
tensorrt_llm/serve/disagg_auto_scaling.py

136-136: Do not catch blind exception: Exception

(BLE001)

tensorrt_llm/serve/openai_disagg_service.py

40-40: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


65-65: Avoid specifying long messages outside the exception class

(TRY003)


73-75: Avoid specifying long messages outside the exception class

(TRY003)


83-83: Avoid specifying long messages outside the exception class

(TRY003)


249-249: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


249-249: Avoid specifying long messages outside the exception class

(TRY003)


264-266: Avoid specifying long messages outside the exception class

(TRY003)


268-268: Avoid specifying long messages outside the exception class

(TRY003)


270-270: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/serve/openai_client.py

41-41: Prefer TypeError exception for invalid type

(TRY004)


41-41: Avoid specifying long messages outside the exception class

(TRY003)


65-65: OpenAIClient.shutdown is an empty method in an abstract base class, but has no abstract decorator

(B027)


80-80: Unused method argument: perf_metrics_collector

(ARG002)


139-141: Abstract raise to an inner function

(TRY301)


139-141: Avoid specifying long messages outside the exception class

(TRY003)


229-230: try-except-continue detected, consider logging the exception

(S112)


229-229: Do not catch blind exception: Exception

(BLE001)


241-241: Do not catch blind exception: Exception

(BLE001)


247-247: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


248-248: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

tensorrt_llm/serve/openai_disagg_server.py

17-17: Redefinition of unused CppExecutorError from line 16

(F811)


47-47: Unused method argument: request

(ARG002)


50-50: Unused method argument: response

(ARG002)


53-53: Unused method argument: request

(ARG002)


53-53: Unused method argument: response

(ARG002)


57-57: Unused method argument: response

(ARG002)


59-59: Store a reference to the return value of asyncio.create_task

(RUF006)


93-93: Do not catch blind exception: Exception

(BLE001)


98-98: Unused function argument: app

(ARG001)


141-141: Do not catch blind exception: Exception

(BLE001)


154-154: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/serve/responses_utils.py (1)

894-919: Hooks abstraction looks solid

The lifecycle interface gives us the right touchpoints for the disagg instrumentation work. Nice addition.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 3, 2025

📝 Walkthrough

Walkthrough

This PR introduces a disaggregated OpenAI serving architecture with service-oriented design. It adds HTTP client abstractions, performance metrics collection, event-driven worker management, and new service orchestration modules to coordinate context and generation server workflows.

Changes

Cohort / File(s) Summary
Core API and Utilities
tensorrt_llm/llmapi/disagg_utils.py
Renamed get_ctx_gen_server_urls() to get_ctx_gen_server_addrs() and changed address format from "http://host:port" to "host:port".
Worker Management
tensorrt_llm/serve/disagg_auto_scaling.py
Added optional on_event callback support to watch_workers() for real-time worker lifecycle events; introduced _watch_task attribute and guard check in get_worker_events().
Protocol and Base Types
tensorrt_llm/serve/openai_protocol.py, tensorrt_llm/serve/openai_service.py, tensorrt_llm/serve/responses_utils.py
Added type aliases UCompletionRequest and UCompletionResponse; introduced abstract OpenAIService interface with completion/chat-completion and lifecycle methods; added ResponseHooks abstract interface with four lifecycle callbacks, done_generator() function, and CompletionResponseGenerator type alias. Extended ASGI middleware to populate ctx_server and gen_server scope fields.
HTTP Client
tensorrt_llm/serve/openai_client.py
Introduced abstract OpenAIClient base class and concrete OpenAIHttpClient implementation with retry logic, streaming/non-streaming response handling, metrics integration, and health/metrics endpoints.
Performance Metrics
tensorrt_llm/serve/perf_metrics.py
Added ClientMetricsCollector and DisaggPerfMetricsCollector classes for per-request and aggregated metrics tracking across context/generation servers with async-safe coordination.
Service Orchestration
tensorrt_llm/serve/openai_disagg_service.py
New module with OpenAIDisaggregatedService and OpenAIDisaggregatedPreAllocService providing core disaggregated request dispatch, conditional server routing, readiness checks, and worker event handling.
Server Implementation
tensorrt_llm/serve/openai_disagg_server.py
Major refactor replacing monolithic logic with streamlined OpenAIDisaggServer using service factory pattern; introduced RawRequestResponseHooks for per-request metrics; refactored route registration, health/version endpoints, and clock offset logic with new Prometheus metrics integration.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Server as OpenAIDisaggServer
    participant Service as OpenAIDisaggregatedService
    participant CtxClient as OpenAIHttpClient<br/>(Context)
    participant GenClient as OpenAIHttpClient<br/>(Generation)
    participant Metrics as DisaggPerfMetricsCollector
    participant Router as Router

    Client->>Server: POST /v1/completions
    Note over Server: Wrap with RawRequestResponseHooks
    Server->>Service: openai_completion(request)
    
    Service->>Service: Check readiness
    Service->>Router: Get available context server
    
    rect rgb(200, 220, 255)
        Note over CtxClient: Context Phase
        Service->>CtxClient: send_request(ctx_server, request)
        CtxClient->>CtxClient: POST with retry logic
        CtxClient->>Metrics: Track metrics
        CtxClient-->>Service: Context response + disagg_params
    end
    
    Service->>Service: _check_conditional_disagg()
    Service->>Router: Select generation server
    
    rect rgb(220, 200, 255)
        Note over GenClient: Generation Phase
        Service->>GenClient: send_request(gen_server, request)
        GenClient->>GenClient: Stream response
        GenClient->>Metrics: Track per-token metrics
        GenClient-->>Service: Streaming events
    end
    
    Service-->>Server: CompletionResponseGenerator
    Server-->>Client: Streaming response with hooks

    Metrics->>Metrics: Aggregate metrics by request_id
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

  • New module structure: Three new service modules (openai_disagg_service.py, openai_client.py, perf_metrics.py) introduce substantial new logic requiring separate reasoning for each component.
  • Architectural refactor: openai_disagg_server.py undergoes significant restructuring from monolithic to service-oriented design; public API surface changes require careful validation against existing callers.
  • Event-driven callbacks: disagg_auto_scaling.py introduces async callback chains with task management; concurrent event handling requires careful review.
  • Metrics aggregation complexity: DisaggPerfMetricsCollector coordinates async state across multiple clients with per-request key management and capacity-bounded deques; potential race conditions and resource leaks warrant scrutiny.
  • HTTP client retry and streaming logic: OpenAIHttpClient._post_with_retry() and _response_generator() handle streaming responses with metrics hooks at multiple points; edge cases in response parsing and hook invocation need validation.
  • Breaking public API changes: Function rename in disagg_utils.py and class constructor signature changes in openai_disagg_server.py require impact analysis.

Areas requiring extra attention:

  • Concurrent task management in disagg_auto_scaling.py::watch_workers() with _watch_task lifecycle
  • Async coordination and lock usage patterns in DisaggPerfMetricsCollector to prevent deadlocks
  • Streaming response handling edge cases in OpenAIHttpClient._response_generator() (partial frames, encoding issues)
  • Router/metadata server initialization sequence in openai_disagg_server.py::setup()
  • Conditional disaggregation logic in OpenAIDisaggregatedService._check_conditional_disagg()
  • Metrics hook invocation points and their impact on request latency

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is largely incomplete and does not follow the repository's PR template effectively. While the template sections for Description and Test Coverage are present, they contain only comments with no substantive content provided by the author. The author merely states 'Still working on a hang issue with streaming requests, but open for review to speed up the review process,' which is insufficient to explain the changes. The PR Checklist is mostly empty (only one checkbox marked), and the description does not clearly explain what changes were made, why they were made, or what test coverage exists for the extensive modifications across multiple new modules and significant refactoring. The author should complete the PR description by: (1) providing a clear, detailed explanation of the disagg service decoupling from FastAPI, including architectural changes and benefits; (2) listing the relevant test cases that validate the new abstractions and implementations; (3) completing the PR checklist items to confirm adherence to coding guidelines, test coverage, and documentation updates; and (4) removing or clarifying the mention of an unresolved 'hang issue' to indicate readiness for review.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title '[TRTLLM-8920][feat] decouple disagg service from fastapi' is clear, specific, and directly relates to the primary change across the files. It accurately reflects the main objective: decoupling the disaggregated service architecture from FastAPI by introducing abstract service interfaces (OpenAIService), abstract client interfaces (OpenAIClient), and new service implementations (OpenAIDisaggregatedService, OpenAIHttpClient) that enable framework-independent request handling. The title follows the repository's naming convention with a JIRA ticket and feature type tag.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8303cfa and 8ea9d89.

📒 Files selected for processing (9)
  • tensorrt_llm/llmapi/disagg_utils.py (1 hunks)
  • tensorrt_llm/serve/disagg_auto_scaling.py (4 hunks)
  • tensorrt_llm/serve/openai_client.py (1 hunks)
  • tensorrt_llm/serve/openai_disagg_server.py (3 hunks)
  • tensorrt_llm/serve/openai_disagg_service.py (1 hunks)
  • tensorrt_llm/serve/openai_protocol.py (1 hunks)
  • tensorrt_llm/serve/openai_service.py (1 hunks)
  • tensorrt_llm/serve/perf_metrics.py (1 hunks)
  • tensorrt_llm/serve/responses_utils.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/serve/openai_protocol.py
  • tensorrt_llm/llmapi/disagg_utils.py
  • tensorrt_llm/serve/openai_service.py
  • tensorrt_llm/serve/responses_utils.py
  • tensorrt_llm/serve/openai_disagg_server.py
  • tensorrt_llm/serve/perf_metrics.py
  • tensorrt_llm/serve/openai_disagg_service.py
  • tensorrt_llm/serve/openai_client.py
  • tensorrt_llm/serve/disagg_auto_scaling.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/serve/openai_protocol.py
  • tensorrt_llm/llmapi/disagg_utils.py
  • tensorrt_llm/serve/openai_service.py
  • tensorrt_llm/serve/responses_utils.py
  • tensorrt_llm/serve/openai_disagg_server.py
  • tensorrt_llm/serve/perf_metrics.py
  • tensorrt_llm/serve/openai_disagg_service.py
  • tensorrt_llm/serve/openai_client.py
  • tensorrt_llm/serve/disagg_auto_scaling.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/serve/openai_protocol.py
  • tensorrt_llm/llmapi/disagg_utils.py
  • tensorrt_llm/serve/openai_service.py
  • tensorrt_llm/serve/responses_utils.py
  • tensorrt_llm/serve/openai_disagg_server.py
  • tensorrt_llm/serve/perf_metrics.py
  • tensorrt_llm/serve/openai_disagg_service.py
  • tensorrt_llm/serve/openai_client.py
  • tensorrt_llm/serve/disagg_auto_scaling.py
🧬 Code graph analysis (7)
tensorrt_llm/serve/openai_service.py (3)
tensorrt_llm/serve/openai_protocol.py (4)
  • ChatCompletionRequest (500-717)
  • ChatCompletionResponse (441-450)
  • CompletionRequest (210-341)
  • CompletionResponse (146-155)
tensorrt_llm/serve/openai_disagg_service.py (5)
  • openai_completion (61-77)
  • openai_chat_completion (79-84)
  • is_ready (183-186)
  • setup (196-217)
  • teardown (219-228)
tensorrt_llm/serve/disagg_auto_scaling.py (1)
  • is_ready (224-225)
tensorrt_llm/serve/responses_utils.py (2)
tensorrt_llm/serve/openai_protocol.py (1)
  • ResponsesResponse (846-911)
tensorrt_llm/serve/openai_disagg_server.py (4)
  • on_req_begin (47-48)
  • on_ctx_resp (50-51)
  • on_first_token (53-55)
  • on_resp_done (57-59)
tensorrt_llm/serve/openai_disagg_server.py (8)
tensorrt_llm/executor/executor.py (1)
  • CppExecutorError (60-68)
tensorrt_llm/llmapi/disagg_utils.py (3)
  • DisaggServerConfig (68-78)
  • MetadataServerConfig (82-87)
  • get_ctx_gen_server_addrs (90-101)
tensorrt_llm/serve/cluster_storage.py (4)
  • HttpClusterStorageServer (142-296)
  • create_cluster_storage (109-114)
  • client (464-465)
  • add_routes (158-167)
tensorrt_llm/serve/openai_client.py (2)
  • OpenAIClient (28-70)
  • OpenAIHttpClient (73-249)
tensorrt_llm/serve/openai_disagg_service.py (7)
  • OpenAIDisaggregatedService (34-271)
  • disagg_cluster_config (189-190)
  • setup (196-217)
  • teardown (219-228)
  • openai_completion (61-77)
  • cluster_info (177-181)
  • is_ready (183-186)
tensorrt_llm/serve/responses_utils.py (6)
  • ResponseHooks (894-919)
  • get_steady_clock_now_in_seconds (86-87)
  • on_req_begin (900-901)
  • on_ctx_resp (904-905)
  • on_first_token (908-912)
  • on_resp_done (915-919)
tensorrt_llm/serve/perf_metrics.py (4)
  • DisaggPerfMetricsCollector (63-154)
  • add_per_request_metrics (74-91)
  • add_client (71-72)
  • get_perf_metrics (93-148)
tensorrt_llm/serve/router.py (2)
  • Router (146-410)
  • create_router (650-685)
tensorrt_llm/serve/perf_metrics.py (1)
tensorrt_llm/serve/openai_client.py (2)
  • collect_metrics (58-58)
  • collect_metrics (223-231)
tensorrt_llm/serve/openai_disagg_service.py (8)
tensorrt_llm/llmapi/disagg_utils.py (3)
  • ConditionalDisaggConfig (42-43)
  • DisaggClusterConfig (59-64)
  • ServerRole (19-22)
tensorrt_llm/serve/cluster_storage.py (2)
  • ClusterStorage (60-106)
  • WatchEventType (26-28)
tensorrt_llm/serve/disagg_auto_scaling.py (9)
  • DisaggClusterManager (32-229)
  • WorkerInfo (17-21)
  • is_ready (224-225)
  • cluster_info (65-82)
  • start (58-59)
  • watch_workers (96-143)
  • stop (61-63)
  • worker_info (265-269)
  • worker_id (261-262)
tensorrt_llm/serve/openai_client.py (6)
  • OpenAIClient (28-70)
  • send_request (29-41)
  • shutdown (65-65)
  • shutdown (233-234)
  • check_ready (61-63)
  • check_ready (236-249)
tensorrt_llm/serve/openai_protocol.py (3)
  • ChatCompletionRequest (500-717)
  • CompletionRequest (210-341)
  • DisaggregatedParams (104-109)
tensorrt_llm/serve/openai_service.py (6)
  • OpenAIService (13-39)
  • openai_completion (15-23)
  • is_ready (33-33)
  • openai_chat_completion (26-30)
  • setup (36-36)
  • teardown (39-39)
tensorrt_llm/serve/responses_utils.py (3)
  • ResponseHooks (894-919)
  • done_generator (922-923)
  • on_req_begin (900-901)
tensorrt_llm/serve/router.py (5)
  • KvCacheAwareRouter (541-647)
  • Router (146-410)
  • start_server_monitoring (204-216)
  • stop_server_monitoring (218-231)
  • remove_server (183-194)
tensorrt_llm/serve/openai_client.py (5)
tensorrt_llm/serve/openai_protocol.py (4)
  • ChatCompletionRequest (500-717)
  • ChatCompletionResponse (441-450)
  • CompletionRequest (210-341)
  • CompletionResponse (146-155)
tensorrt_llm/serve/perf_metrics.py (6)
  • ClientMetricsCollector (42-60)
  • DisaggPerfMetricsCollector (63-154)
  • inc (56-57)
  • inc (150-151)
  • observe (59-60)
  • observe (153-154)
tensorrt_llm/serve/responses_utils.py (5)
  • ResponseHooks (894-919)
  • get_steady_clock_now_in_seconds (86-87)
  • on_ctx_resp (904-905)
  • on_first_token (908-912)
  • on_resp_done (915-919)
tensorrt_llm/serve/router.py (1)
  • Router (146-410)
tensorrt_llm/serve/openai_disagg_server.py (3)
  • on_ctx_resp (50-51)
  • on_first_token (53-55)
  • on_resp_done (57-59)
tensorrt_llm/serve/disagg_auto_scaling.py (2)
tensorrt_llm/serve/cluster_storage.py (10)
  • WatchEventType (26-28)
  • watch (94-95)
  • watch (239-248)
  • watch (388-390)
  • watch (533-546)
  • drain (43-51)
  • unwatch (98-99)
  • unwatch (250-257)
  • unwatch (392-394)
  • unwatch (548-551)
tensorrt_llm/logger.py (2)
  • error (126-127)
  • warning (132-133)
🪛 GitHub Actions: Release Checks
tensorrt_llm/serve/openai_service.py

[error] 18-21: D205 1 blank line required between summary line and description

tensorrt_llm/serve/openai_disagg_service.py

[error] 89-93: D205 1 blank line required between summary line and description


[error] 89-89: D415 First line should end with a period, question mark, or exclamation point


[error] 91-91: E501 Line too long (122 > 120)

tensorrt_llm/serve/openai_client.py

[error] 52-52: D205 1 blank line required between summary line and description


[error] 62-62: D415 First line should end with a period, question mark, or exclamation point


[error] 111-111: F821 Undefined name anext

🪛 Ruff (0.14.2)
tensorrt_llm/serve/openai_disagg_server.py

17-17: Redefinition of unused CppExecutorError from line 16

(F811)


47-47: Unused method argument: request

(ARG002)


50-50: Unused method argument: response

(ARG002)


53-53: Unused method argument: request

(ARG002)


53-53: Unused method argument: response

(ARG002)


57-57: Unused method argument: response

(ARG002)


59-59: Store a reference to the return value of asyncio.create_task

(RUF006)


93-93: Do not catch blind exception: Exception

(BLE001)


98-98: Unused function argument: app

(ARG001)


141-141: Do not catch blind exception: Exception

(BLE001)


154-154: Use explicit conversion flag

Replace with conversion flag

(RUF010)

tensorrt_llm/serve/openai_disagg_service.py

40-40: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


65-65: Avoid specifying long messages outside the exception class

(TRY003)


73-75: Avoid specifying long messages outside the exception class

(TRY003)


83-83: Avoid specifying long messages outside the exception class

(TRY003)


249-249: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


249-249: Avoid specifying long messages outside the exception class

(TRY003)


264-266: Avoid specifying long messages outside the exception class

(TRY003)


268-268: Avoid specifying long messages outside the exception class

(TRY003)


270-270: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/serve/openai_client.py

41-41: Prefer TypeError exception for invalid type

(TRY004)


41-41: Avoid specifying long messages outside the exception class

(TRY003)


65-65: OpenAIClient.shutdown is an empty method in an abstract base class, but has no abstract decorator

(B027)


80-80: Unused method argument: perf_metrics_collector

(ARG002)


139-141: Abstract raise to an inner function

(TRY301)


139-141: Avoid specifying long messages outside the exception class

(TRY003)


229-230: try-except-continue detected, consider logging the exception

(S112)


229-229: Do not catch blind exception: Exception

(BLE001)


241-241: Do not catch blind exception: Exception

(BLE001)


247-247: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


248-248: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

tensorrt_llm/serve/disagg_auto_scaling.py

136-136: Do not catch blind exception: Exception

(BLE001)

@reasonsolo reasonsolo force-pushed the TRTLLM-8920_decouplefastapi branch from 8ea9d89 to 50348fd Compare November 4, 2025 07:56
@reasonsolo
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23487 [ run ] triggered by Bot. Commit: 50348fd

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23487 [ run ] completed with state FAILURE. Commit: 50348fd
/LLM/main/L0_MergeRequest_PR pipeline #17680 completed with status: 'FAILURE'

@reasonsolo reasonsolo force-pushed the TRTLLM-8920_decouplefastapi branch from 50348fd to 57451e6 Compare November 4, 2025 09:27
@reasonsolo
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23496 [ run ] triggered by Bot. Commit: 57451e6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23496 [ run ] completed with state SUCCESS. Commit: 57451e6
/LLM/main/L0_MergeRequest_PR pipeline #17687 completed with status: 'FAILURE'

@reasonsolo
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23576 [ run ] triggered by Bot. Commit: b419e60

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23576 [ run ] completed with state SUCCESS. Commit: b419e60
/LLM/main/L0_MergeRequest_PR pipeline #17741 completed with status: 'FAILURE'

@reasonsolo reasonsolo force-pushed the TRTLLM-8920_decouplefastapi branch from b419e60 to ade914e Compare November 5, 2025 10:26
@reasonsolo
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23640 [ run ] triggered by Bot. Commit: ade914e

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24129 [ run ] completed with state SUCCESS. Commit: 192a36e
/LLM/main/L0_MergeRequest_PR pipeline #18190 completed with status: 'FAILURE'

@reasonsolo
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24161 [ run ] triggered by Bot. Commit: 192a36e

@reasonsolo reasonsolo enabled auto-merge (squash) November 11, 2025 10:10
@tensorrt-cicd
Copy link
Collaborator

PR_Github #24161 [ run ] completed with state SUCCESS. Commit: 192a36e
/LLM/main/L0_MergeRequest_PR pipeline #18217 completed with status: 'FAILURE'

@reasonsolo reasonsolo force-pushed the TRTLLM-8920_decouplefastapi branch from 192a36e to cf7deb1 Compare November 11, 2025 14:34
@reasonsolo
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24193 [ run ] triggered by Bot. Commit: cf7deb1

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24193 [ run ] completed with state SUCCESS. Commit: cf7deb1
/LLM/main/L0_MergeRequest_PR pipeline #18243 completed with status: 'FAILURE'

@reasonsolo
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24264 [ run ] triggered by Bot. Commit: cf7deb1

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24264 [ run ] completed with state SUCCESS. Commit: cf7deb1
/LLM/main/L0_MergeRequest_PR pipeline #18305 completed with status: 'FAILURE'

Signed-off-by: Lizhi Zhou <[email protected]>
Signed-off-by: Lizhi Zhou <[email protected]>
Signed-off-by: Lizhi Zhou <[email protected]>
Signed-off-by: Lizhi Zhou <[email protected]>
Signed-off-by: Lizhi Zhou <[email protected]>
@reasonsolo reasonsolo force-pushed the TRTLLM-8920_decouplefastapi branch from cf7deb1 to 4541c5a Compare November 13, 2025 00:43
@reasonsolo
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24349 [ run ] triggered by Bot. Commit: 4541c5a

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24349 [ run ] completed with state SUCCESS. Commit: 4541c5a
/LLM/main/L0_MergeRequest_PR pipeline #18377 completed with status: 'FAILURE'

@reasonsolo
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24391 [ run ] triggered by Bot. Commit: 4541c5a

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24391 [ run ] completed with state SUCCESS. Commit: 4541c5a
/LLM/main/L0_MergeRequest_PR pipeline #18405 completed with status: 'FAILURE'

@reasonsolo
Copy link
Collaborator Author

/bot run

@reasonsolo reasonsolo disabled auto-merge November 14, 2025 00:58
@reasonsolo
Copy link
Collaborator Author

Hold this up to avoid introducing more changes of disagg-serving which caused a lot CI failures recently.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24524 [ run ] triggered by Bot. Commit: 4541c5a

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24524 [ run ] completed with state SUCCESS. Commit: 4541c5a
/LLM/main/L0_MergeRequest_PR pipeline #18510 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants